6 research outputs found
MortonNet: Self-Supervised Learning of Local Features in 3D Point Clouds
We present a self-supervised task on point clouds, in order to learn
meaningful point-wise features that encode local structure around each point.
Our self-supervised network, named MortonNet, operates directly on
unstructured/unordered point clouds. Using a multi-layer RNN, MortonNet
predicts the next point in a point sequence created by a popular and fast Space
Filling Curve, the Morton-order curve. The final RNN state (coined Morton
feature) is versatile and can be used in generic 3D tasks on point clouds. In
fact, we show how Morton features can be used to significantly improve
performance (+3% for 2 popular semantic segmentation algorithms) in the task of
semantic segmentation of point clouds on the challenging and large-scale S3DIS
dataset. We also show how MortonNet trained on S3DIS transfers well to another
large-scale dataset, vKITTI, leading to an improvement over state-of-the-art of
3.8%. Finally, we use Morton features to train a much simpler and more stable
model for part segmentation in ShapeNet. Our results show how our
self-supervised task results in features that are useful for 3D segmentation
tasks, and generalize well to other datasets
RefineLoc: Iterative Refinement for Weakly-Supervised Action Localization
Video action detectors are usually trained using datasets with
fully-supervised temporal annotations. Building such datasets is an expensive
task. To alleviate this problem, recent methods have tried to leverage weak
labeling, where videos are untrimmed and only a video-level label is available.
In this paper, we propose RefineLoc, a novel weakly-supervised temporal action
localization method. RefineLoc uses an iterative refinement approach by
estimating and training on snippet-level pseudo ground truth at every
iteration. We show the benefit of this iterative approach and present an
extensive analysis of five different pseudo ground truth generators. We show
the effectiveness of our model on two standard action datasets, ActivityNet
v1.2 and THUMOS14. RefineLoc shows competitive results with the
state-of-the-art in weakly-supervised temporal localization. Additionally, our
iterative refinement process is able to significantly improve the performance
of two state-of-the-art methods, setting a new state-of-the-art on THUMOS14.Comment: Accepted to WACV 2021. Project website:
http://humamalwassel.com/publication/refinelo
Diagnosing Error in Temporal Action Detectors
Despite the recent progress in video understanding and the continuous rate of
improvement in temporal action localization throughout the years, it is still
unclear how far (or close?) we are to solving the problem. To this end, we
introduce a new diagnostic tool to analyze the performance of temporal action
detectors in videos and compare different methods beyond a single scalar
metric. We exemplify the use of our tool by analyzing the performance of the
top rewarded entries in the latest ActivityNet action localization challenge.
Our analysis shows that the most impactful areas to work on are: strategies to
better handle temporal context around the instances, improving the robustness
w.r.t. the instance absolute and relative size, and strategies to reduce the
localization errors. Moreover, our experimental analysis finds the lack of
agreement among annotator is not a major roadblock to attain progress in the
field. Our diagnostic tool is publicly available to keep fueling the minds of
other researchers with additional insights about their algorithms.Comment: Accepted to ECCV 201
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Visual and audio modalities are highly correlated, yet they contain different
information. Their strong correlation makes it possible to predict the
semantics of one from the other with good accuracy. Their intrinsic differences
make cross-modal prediction a potentially more rewarding pretext task for
self-supervised learning of video and audio representations compared to
within-modality learning. Based on this intuition, we propose Cross-Modal Deep
Clustering (XDC), a novel self-supervised method that leverages unsupervised
clustering in one modality (e.g., audio) as a supervisory signal for the other
modality (e.g., video). This cross-modal supervision helps XDC utilize the
semantic correlation and the differences between the two modalities. Our
experiments show that XDC outperforms single-modality clustering and other
multi-modal variants. XDC achieves state-of-the-art accuracy among
self-supervised methods on multiple video and audio benchmarks. Most
importantly, our video model pretrained on large-scale unlabeled data
significantly outperforms the same model pretrained with full-supervision on
ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best
of our knowledge, XDC is the first self-supervised learning method that
outperforms large-scale fully-supervised pretraining for action recognition on
the same architecture
ActivityNet Challenge 2017 Summary
The ActivityNet Large Scale Activity Recognition Challenge 2017 Summary:
results and challenge participants papers.Comment: 76 page
The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary
The 3rd annual installment of the ActivityNet Large- Scale Activity
Recognition Challenge, held as a full-day workshop in CVPR 2018, focused on the
recognition of daily life, high-level, goal-oriented activities from
user-generated videos as those found in internet video portals. The 2018
challenge hosted six diverse tasks which aimed to push the limits of semantic
visual understanding of videos as well as bridge visual content with human
captions. Three out of the six tasks were based on the ActivityNet dataset,
which was introduced in CVPR 2015 and organized hierarchically in a semantic
taxonomy. These tasks focused on tracing evidence of activities in time in the
form of proposals, class labels, and captions. In this installment of the
challenge, we hosted three guest tasks to enrich the understanding of visual
information in videos. The guest tasks focused on complementary aspects of the
activity recognition problem at large scale and involved three challenging and
recently compiled datasets: the Kinetics-600 dataset from Google DeepMind, the
AVA dataset from Berkeley and Google, and the Moments in Time dataset from MIT
and IBM Research.Comment: CVPR Workshop 2018 challenge summar